security defect
Security Code Review by LLMs: A Deep Dive into Responses
Yu, Jiaxin, Liang, Peng, Fu, Yujia, Tahir, Amjed, Shahin, Mojtaba, Wang, Chong, Cai, Yangxiao
Security code review aims to combine automated tools and manual efforts to detect security defects during development. The rapid development of Large Language Models (LLMs) has shown promising potential in software development, as well as opening up new possibilities in automated security code review. To explore the challenges of applying LLMs in practical code review for security defect detection, this study compared the detection performance of three state-of-the-art LLMs (Gemini Pro, GPT-4, and GPT-3.5) under five prompts on 549 code files that contain security defects from real-world code reviews. Through analyzing 82 responses generated by the best-performing LLM-prompt combination based on 100 randomly selected code files, we extracted and categorized quality problems present in these responses into 5 themes and 16 categories. Our results indicate that the responses produced by LLMs often suffer from verbosity, vagueness, and incompleteness, highlighting the necessity to enhance their conciseness, understandability, and compliance to security defect detection. This work reveals the deficiencies of LLM-generated responses in security code review and paves the way for future optimization of LLMs towards this task.
- Asia > China > Hubei Province > Wuhan (0.05)
- North America > United States > District of Columbia > Washington (0.05)
- Oceania > New Zealand > North Island > Manawatū-Whanganui > Palmerston North (0.04)
- (2 more...)
Towards security defect prediction with AI
Sestili, Carson D., Snavely, William S., VanHoudnos, Nathan M.
Abstract--In this study, we investigate the limits of the current state of the art AI system for detecting buffer overflows and compare it with current static analysis tools. To do so, we developed a code generator, sbAbI, capable of producing an arbitrarily large number of code samples of controlled complexity. We found that the static analysis engines we examined have good precision, but poor recall on this dataset, except for a sound static analyzer that has good precision and recall. We found that the state of the art AI system, a memory network modeled after Choi et al. [1], can achieve similar performance to the static analysis engines, but requires an exhaustive amount of training data in order to do so. Our work points towards future approaches that may solve these problems; namely, using representations of code that can capture appropriate scope information and using deep learning methods that are able to perform arithmetic operations. Predicting security defects in source code is of significant national security interest. It is ideal to detect security defects during development, before the code is ever run to expose those defects. The current best methods to find security defects before running code are static analysis tools, a variety of which exist and model software in different ways that are all useful for different kinds of flaws. Developers of static analyzers carefully equip them with rules about program behavior, which are used to reason about the safety of the program if it were to run. However, static analyzers are known to be insufficient at finding flaws. The Juliet Test Suite [2]-[4] is a collection of synthetic code containing intentional security defects across hundreds of vulnerabilities in the Common Weakness Enumeration standard, labeled at the line-of-code level. Even state-ofthe art static analyzers perform poorly at finding the defects in Juliet, issuing too many false positives and also too many false negatives [5]-[8].
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
- North America > United States > Maryland > Montgomery County > Gaithersburg (0.04)
- North America > United States > Illinois (0.04)